Kruskal–Wallis H test (one-way ANOVA on ranks) — from scratch#

The Kruskal–Wallis test answers:

“Do two or more independent groups differ in their typical values when I don’t want to assume normality?”

It is a nonparametric alternative to one-way ANOVA. Instead of comparing means, it ranks all observations and checks whether some groups systematically receive higher (or lower) ranks.

Learning goals#

  • Know when Kruskal–Wallis is the right test (and when it isn’t).

  • Understand the null/alternative hypotheses and the assumptions.

  • Compute the H statistic from ranks and rank sums.

  • Implement the test with NumPy only (including tie correction).

  • Interpret results (p-value + effect size) and visualize what’s happening with Plotly.

Notebook roadmap#

  1. What the test is used for

  2. Hypotheses + assumptions

  3. Intuition: “ANOVA on ranks”

  4. The H statistic (with tie correction)

  5. A tiny worked example (ranks by hand)

  6. NumPy-only implementation

  7. A realistic simulation + Plotly visuals

  8. Tie correction demo

  9. Interpretation + reporting

  10. Pitfalls

  11. Exercises + references

import numpy as np

import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

SEED = 7
rng = np.random.default_rng(SEED)
import sys
import plotly

print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("Plotly:", plotly.__version__)
print("Seed:", SEED)
Python: 3.12.9
NumPy: 1.26.2
Plotly: 6.5.2
Seed: 7

1) What is the Kruskal–Wallis test used for?#

Use Kruskal–Wallis when you have:

  • One categorical factor with 2+ independent groups (A/B/C/…)

  • A numeric or ordinal outcome

  • You want a test that does not assume normality

Common situations:

  • A/B/C testing where the metric is skewed (e.g., time-on-page).

  • Comparing ratings (Likert-scale) across multiple groups.

  • Medical/biological measurements that are heavy-tailed or contain outliers.

Relationship to other tests:

  • One-way ANOVA: compares means assuming (roughly) normal residuals.

  • Mann–Whitney U: compares two independent groups using ranks.

    • Kruskal–Wallis is the natural extension to k groups.

  • Friedman test: the rank-based analogue for repeated-measures / paired designs (not Kruskal–Wallis).

2) Hypotheses and assumptions#

2.1 Hypotheses#

Let there be \(k\) groups.

  • Null hypothesis \(H_0\): the groups come from the same distribution (equivalently: group label has no effect on the outcome).

  • Alternative \(H_1\): at least one group tends to produce larger or smaller values.

Important nuance:

  • Kruskal–Wallis is often described as a test of “medians”, but strictly it detects distributional differences.

  • If group distributions have similar shapes/spreads, then a significant result is commonly interpreted as a difference in location (typical value).

2.2 Assumptions (practical)#

  • Observations are independent within and across groups.

  • The outcome is at least ordinal (ranks make sense).

  • Groups are independent (between-subject design).

If independence is violated, the p-value can be severely wrong.

3) Intuition: “ANOVA on ranks”#

  1. Pool all observations from all groups.

  2. Replace the raw values with their ranks (1 = smallest).

  3. Compute each group’s sum of ranks (or mean rank).

If all groups really come from the same distribution, their ranks should be well-mixed and each group’s average rank should be close to the overall average rank:

\[ \text{overall mean rank} = \frac{N + 1}{2} \]

where \(N\) is the total number of observations.

If one group tends to have larger values, it will accumulate larger ranks → the rank sums become very different → the test statistic grows.

4) The H statistic (with tie correction)#

Let:

  • \(k\) = number of groups

  • \(n_i\) = sample size of group \(i\)

  • \(N = \sum_{i=1}^k n_i\) = total sample size

  • \(R_i\) = sum of ranks in group \(i\) (ranks computed on the pooled sample)

The Kruskal–Wallis statistic is:

\[ H = \frac{12}{N(N+1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(N+1) \]

Tie correction#

If there are ties in the pooled data, ranks are averaged. Ties reduce the variance of ranks, so we apply a correction:

\[ C = 1 - \frac{\sum_j (t_j^3 - t_j)}{N^3 - N} \]

where \(t_j\) are the sizes of tied groups (e.g., if the value 10 appears 4 times, one of the \(t_j\) is 4).

The tie-corrected statistic is:

\[ H_{\text{corr}} = \frac{H}{C} \]

Under \(H_0\) and with sufficiently large samples, \(H_{\text{corr}}\) is approximately \(\chi^2\) distributed with \(k-1\) degrees of freedom.

In this notebook we compute H exactly (NumPy-only) and estimate the p-value via a permutation test (also NumPy-only).

5) Tiny worked example (see ranks explicitly)#

We’ll use three small groups so we can see the mechanics clearly.

g1 = np.array([1, 3, 5])
g2 = np.array([2, 4, 6])
g3 = np.array([7, 8, 9])

groups_tiny = [g1, g2, g3]
group_names_tiny = ["A", "B", "C"]
def rankdata_average(x: np.ndarray):
    """Average ranks for ties (NumPy-only).

    Returns
    - ranks: float array of same shape as x (flattened)
    - tie_counts: 1D int array containing sizes of tie groups (>1)
    """
    x = np.asarray(x)
    x = x.ravel()
    n = x.size
    if n == 0:
        raise ValueError("x must be non-empty")

    order = np.argsort(x, kind="mergesort")
    sorted_x = x[order]

    ranks_sorted = np.empty(n, dtype=float)
    tie_counts = []

    diffs = np.diff(sorted_x)
    run_starts = np.r_[0, np.nonzero(diffs != 0)[0] + 1]
    run_ends = np.r_[run_starts[1:], n]

    for start, end in zip(run_starts, run_ends):
        avg_rank = (start + 1 + end) / 2.0  # ranks are 1..n
        ranks_sorted[start:end] = avg_rank
        count = end - start
        if count > 1:
            tie_counts.append(count)

    ranks = np.empty(n, dtype=float)
    ranks[order] = ranks_sorted
    return ranks, np.array(tie_counts, dtype=int)
all_values_tiny = np.concatenate(groups_tiny)
all_ranks_tiny, tie_counts_tiny = rankdata_average(all_values_tiny)

labels_tiny = np.concatenate([
    np.repeat(name, len(g)) for name, g in zip(group_names_tiny, groups_tiny)
])

table = np.column_stack([labels_tiny, all_values_tiny.astype(str), all_ranks_tiny.astype(str)])
print("group  value  rank")
for row in table:
    print(f"{row[0]:>5}  {row[1]:>5}  {row[2]:>4}")

print("\nTies:", tie_counts_tiny if tie_counts_tiny.size else "none")
group  value  rank
    A      1   1.0
    A      3   3.0
    A      5   5.0
    B      2   2.0
    B      4   4.0
    B      6   6.0
    C      7   7.0
    C      8   8.0
    C      9   9.0

Ties: none
# Visualize mean ranks (this is what the test is sensitive to)
starts = np.r_[0, np.cumsum([len(g) for g in groups_tiny])[:-1]]
rank_sums_tiny = np.add.reduceat(all_ranks_tiny, starts)
ns_tiny = np.array([len(g) for g in groups_tiny])
mean_ranks_tiny = rank_sums_tiny / ns_tiny
overall_mean_rank_tiny = (len(all_ranks_tiny) + 1) / 2

fig = go.Figure(
    data=[go.Bar(x=group_names_tiny, y=mean_ranks_tiny, text=np.round(mean_ranks_tiny, 2), textposition="outside")]
)
fig.add_hline(y=overall_mean_rank_tiny, line_dash="dash", line_color="gray", annotation_text="overall mean rank")
fig.update_layout(
    title="Tiny example: mean ranks by group",
    xaxis_title="Group",
    yaxis_title="Mean rank",
    yaxis=dict(range=[0, len(all_ranks_tiny) + 1]),
)
fig.show()

6) NumPy-only Kruskal–Wallis implementation#

We’ll implement:

  • computing ranks (average ties)

  • computing \(H\) and tie correction

  • estimating the p-value via a permutation test

The permutation test is conceptually simple:

  • Under \(H_0\) the group labels are exchangeable.

  • Shuffle which observations belong to which group (keeping group sizes fixed).

  • Recompute \(H\) each time.

  • The p-value is the fraction of shuffled datasets where \(H_{\text{perm}} \ge H_{\text{obs}}\).

def kruskal_wallis_h(*groups: np.ndarray):
    """Compute Kruskal–Wallis H statistic (tie-corrected), NumPy-only."""
    if len(groups) < 2:
        raise ValueError("Need at least two groups")

    groups = [np.asarray(g).ravel().astype(float) for g in groups]
    if any(g.size == 0 for g in groups):
        raise ValueError("All groups must be non-empty")

    ns = np.array([g.size for g in groups], dtype=int)
    k = len(groups)
    n_total = int(ns.sum())

    pooled = np.concatenate(groups)
    ranks, tie_counts = rankdata_average(pooled)

    starts = np.r_[0, np.cumsum(ns)[:-1]]
    rank_sums = np.add.reduceat(ranks, starts)

    H = (12.0 / (n_total * (n_total + 1.0))) * np.sum((rank_sums**2) / ns) - 3.0 * (n_total + 1.0)

    if tie_counts.size:
        tie_term = np.sum(tie_counts**3 - tie_counts)
        C = 1.0 - tie_term / (n_total**3 - n_total)
    else:
        C = 1.0

    H_corr = H / C

    mean_ranks = rank_sums / ns
    df = k - 1

    # A common effect size for Kruskal–Wallis
    epsilon_sq = max(0.0, (H_corr - (k - 1)) / (n_total - k)) if n_total > k else np.nan

    details = {
        "H": float(H_corr),
        "H_uncorrected": float(H),
        "tie_correction_C": float(C),
        "df": int(df),
        "n_total": int(n_total),
        "group_sizes": ns,
        "rank_sums": rank_sums,
        "mean_ranks": mean_ranks,
        "epsilon_squared": float(epsilon_sq),
    }
    return details


def kruskal_wallis_permutation_test(*groups: np.ndarray, n_perm: int = 5000, seed: int = 0):
    """Permutation p-value for Kruskal–Wallis (NumPy-only).

    Returns
    - p_value
    - H_obs
    - H_perm: permutation distribution
    """
    if n_perm <= 0:
        raise ValueError("n_perm must be positive")

    groups = [np.asarray(g).ravel().astype(float) for g in groups]
    ns = np.array([g.size for g in groups], dtype=int)
    if any(ns == 0):
        raise ValueError("All groups must be non-empty")

    pooled = np.concatenate(groups)
    ranks, tie_counts = rankdata_average(pooled)
    n_total = ranks.size
    k = len(groups)
    starts = np.r_[0, np.cumsum(ns)[:-1]]

    # tie correction is constant under permutation (pooled values fixed)
    if tie_counts.size:
        tie_term = np.sum(tie_counts**3 - tie_counts)
        C = 1.0 - tie_term / (n_total**3 - n_total)
    else:
        C = 1.0

    rank_sums_obs = np.add.reduceat(ranks, starts)
    H_obs = (12.0 / (n_total * (n_total + 1.0))) * np.sum((rank_sums_obs**2) / ns) - 3.0 * (n_total + 1.0)
    H_obs /= C

    rng_local = np.random.default_rng(seed)
    H_perm = np.empty(n_perm, dtype=float)

    for i in range(n_perm):
        shuffled = rng_local.permutation(ranks)
        rank_sums = np.add.reduceat(shuffled, starts)
        H = (12.0 / (n_total * (n_total + 1.0))) * np.sum((rank_sums**2) / ns) - 3.0 * (n_total + 1.0)
        H_perm[i] = H / C

    # +1 correction avoids p=0 and is common for Monte Carlo/permutation tests
    p_value = (1.0 + np.sum(H_perm >= H_obs)) / (n_perm + 1.0)

    return float(p_value), float(H_obs), H_perm
details_tiny = kruskal_wallis_h(*groups_tiny)
p_tiny, H_obs_tiny, H_perm_tiny = kruskal_wallis_permutation_test(*groups_tiny, n_perm=5000, seed=SEED)

print("Kruskal–Wallis (tiny example)")
print("H (tie-corrected):", details_tiny["H"])
print("df:", details_tiny["df"])
print("Permutation p-value:", p_tiny)
print("Mean ranks:", np.round(details_tiny["mean_ranks"], 3))
print("Epsilon^2 (effect size):", np.round(details_tiny["epsilon_squared"], 4))
Kruskal–Wallis (tiny example)
H (tie-corrected): 5.600000000000001
df: 2
Permutation p-value: 0.05398920215956809
Mean ranks: [3. 4. 8.]
Epsilon^2 (effect size): 0.6
fig = go.Figure()
fig.add_trace(go.Histogram(x=H_perm_tiny, nbinsx=40, name="H under H0 (permutations)", opacity=0.75))
fig.add_vline(x=H_obs_tiny, line_color="crimson", line_width=3, annotation_text="observed H", annotation_position="top")
fig.update_layout(
    title=f"Tiny example: permutation distribution of H (p ≈ {p_tiny:.4f})",
    xaxis_title="H statistic",
    yaxis_title="Count",
)
fig.show()

Interpreting the tiny example#

  • A small p-value means: datasets like this are rare under the null model where group labels do not matter.

  • Kruskal–Wallis is an omnibus test:

    • it can tell you “something differs”

    • it does not tell you which pairs differ (that requires post-hoc tests)

7) A more realistic example + Plotly visuals#

We’ll simulate three groups with skewed (log-normal-like) outcomes.

  • Group A and B have similar typical values

  • Group C is shifted upward

This setup is useful because:

  • one-way ANOVA relies on (roughly) normal residuals

  • Kruskal–Wallis is robust to skew/outliers because it works on ranks

n = 60

# Skewed data: exponentiate normal noise (log-normal-ish)
A = np.exp(rng.normal(loc=0.0, scale=0.6, size=n))
B = np.exp(rng.normal(loc=0.05, scale=0.6, size=n))
C = np.exp(rng.normal(loc=0.35, scale=0.6, size=n))

groups = [A, B, C]
group_names = ["A", "B", "C"]
# Violin plot of raw values (skew + group shift)
fig = go.Figure()
for name, values in zip(group_names, groups):
    fig.add_trace(
        go.Violin(
            y=values,
            name=name,
            box_visible=True,
            meanline_visible=True,
            points="all",
            jitter=0.25,
        )
    )

fig.update_layout(
    title="Simulated skewed data: raw values by group",
    xaxis_title="Group",
    yaxis_title="Outcome value",
)
fig.show()
details = kruskal_wallis_h(*groups)
p_value, H_obs, H_perm = kruskal_wallis_permutation_test(*groups, n_perm=8000, seed=SEED)

print("Kruskal–Wallis (simulation)")
print("H (tie-corrected):", np.round(details["H"], 4))
print("df:", details["df"])
print("Permutation p-value:", np.round(p_value, 6))
print("Mean ranks:", np.round(details["mean_ranks"], 3))
print("Epsilon^2 (effect size):", np.round(details["epsilon_squared"], 4))
Kruskal–Wallis (simulation)
H (tie-corrected): 9.7284
df: 2
Permutation p-value: 0.007499
Mean ranks: [ 77.083  87.983 106.433]
Epsilon^2 (effect size): 0.0437
# Mean ranks summarize the rank shift between groups
overall_mean_rank = (details["n_total"] + 1) / 2

fig = go.Figure(
    data=[
        go.Bar(
            x=group_names,
            y=details["mean_ranks"],
            text=np.round(details["mean_ranks"], 2),
            textposition="outside",
        )
    ]
)
fig.add_hline(y=overall_mean_rank, line_dash="dash", line_color="gray", annotation_text="overall mean rank")
fig.update_layout(
    title="Simulation: mean ranks by group (rank-based signal)",
    xaxis_title="Group",
    yaxis_title="Mean rank",
)
fig.show()
# Visualize rank distributions directly
pooled = np.concatenate(groups)
ranks, _ = rankdata_average(pooled)
labels = np.concatenate([np.repeat(name, len(g)) for name, g in zip(group_names, groups)])

fig = go.Figure()
start = 0
for name, g in zip(group_names, groups):
    end = start + len(g)
    fig.add_trace(
        go.Violin(
            y=ranks[start:end],
            name=name,
            box_visible=True,
            meanline_visible=True,
            points=False,
        )
    )
    start = end

fig.add_hline(y=(len(ranks) + 1) / 2, line_dash="dash", line_color="gray", annotation_text="overall mean rank")
fig.update_layout(
    title="Simulation: rank distributions by group",
    xaxis_title="Group",
    yaxis_title="Rank (pooled)",
)
fig.show()
# Permutation distribution of H under the null
fig = go.Figure()
fig.add_trace(go.Histogram(x=H_perm, nbinsx=60, name="H under H0 (permutations)", opacity=0.75))
fig.add_vline(x=H_obs, line_color="crimson", line_width=3, annotation_text="observed H", annotation_position="top")
fig.update_layout(
    title=f"Simulation: permutation distribution of H (p ≈ {p_value:.6f})",
    xaxis_title="H statistic",
    yaxis_title="Count",
)
fig.show()

8) Tie correction demo#

Ties happen when:

  • the measurement is discrete (e.g., integer ratings)

  • values are rounded

  • the data have many repeated values

Tie correction increases \(H\) slightly (because ties reduce rank variance).

A_tied = np.round(A, 1)
B_tied = np.round(B, 1)
C_tied = np.round(C, 1)

details_tied = kruskal_wallis_h(A_tied, B_tied, C_tied)
print("With ties (rounded to 0.1)")
print("H_uncorrected:", np.round(details_tied["H_uncorrected"], 4))
print("tie correction C:", np.round(details_tied["tie_correction_C"], 6))
print("H_corrected:", np.round(details_tied["H"], 4))
With ties (rounded to 0.1)
H_uncorrected: 9.8297
tie correction C: 0.99506
H_corrected: 9.8785
# Compare rank distributions when ties are introduced
pooled_tied = np.concatenate([A_tied, B_tied, C_tied])
ranks_tied, tie_counts = rankdata_average(pooled_tied)

fig = go.Figure()
start = 0
for name, g in zip(group_names, [A_tied, B_tied, C_tied]):
    end = start + len(g)
    fig.add_trace(go.Violin(y=ranks_tied[start:end], name=name, box_visible=True, meanline_visible=True))
    start = end

fig.update_layout(
    title=f"Ranks with ties present (number of tie groups >1: {len(tie_counts)})",
    xaxis_title="Group",
    yaxis_title="Rank",
)
fig.show()

9) How to interpret and report the result#

9.1 What does a significant result mean?#

If the p-value is below your significance level (e.g., \(\alpha=0.05\)), you reject \(H_0\).

Meaning: the data provide evidence that at least one group differs in its distribution (often interpreted as a location/median difference if shapes are similar).

9.2 What you should report#

  • sample sizes per group

  • the statistic: \(H\) (and whether tie-corrected)

  • degrees of freedom: \(k-1\)

  • p-value

  • an effect size (e.g., \(\varepsilon^2\) / epsilon-squared)

Example write-up:

A Kruskal–Wallis test showed a difference between groups, \(H(2)=8.31\), \(p=0.015\), \(\varepsilon^2=0.12\).

9.3 Post-hoc comparisons#

Kruskal–Wallis does not tell you which groups differ. If it’s significant, follow up with pairwise tests (e.g., Mann–Whitney U) and a multiple-comparisons correction (Bonferroni/Holm/Benjamini–Hochberg).

10) Pitfalls and diagnostics#

  • Independence is crucial: correlated/repeated measures invalidate the test.

  • Different shapes/spreads: a significant result may be driven by variance/shape differences, not just median shifts.

  • Many ties: use tie correction (built-in above). If ties are extreme, consider exact/permutation approaches.

  • Omnibus nature: significant does not identify which groups differ; use post-hoc tests.

  • Practical significance: always pair p-values with an effect size and plots.

11) Exercises#

  1. Implement a pairwise post-hoc routine using permutation tests (with Holm correction).

  2. Compare permutation p-values vs the \(\chi^2\) approximation for increasing sample sizes.

  3. Construct a case where group medians are equal but spreads differ. Does Kruskal–Wallis reject?

References#

  • Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis.

  • Conover, W. J. (1999). Practical Nonparametric Statistics.

  • SciPy documentation for scipy.stats.kruskal (useful for verification).